法国专利FR3072799A1 COMBINING STATES OF MULTIPLE EXECUTIVE WIRES IN A MULTIPLE WIRE PROCESSOR

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A processor comprising: a thread, multiple sets of context registers, a scheduler arranged to control the thread to provide a repetitive sequence of temporally interleaved time slots, thereby enabling at least one respective thread of work. being allocated for execution in each respective slot of some or all of the time slots, wherein a program status of the respective work thread being executed in each time slot is maintained in a respective set of sets of registers of context; and an output status register arranged to store an aggregate output state of the work threads. The instruction set includes an output instruction to be included in each thread, the output state instruction taking an individual output state of the respective thread as an operand. The output instruction terminates the respective thread and also causes the individual output state specified in the operand to contribute to the aggregated output state.
公开号:FR3072799A1
申请号:FR1859635
申请日:2018-10-18
公开日:2019-04-26
发明作者:Simon Christian Knowles
申请人:Graphcore Ltd；
IPC主号:

专利说明:

DESCRIPTION
TITLE: COMBINING STATES OF MULTIPLE EXECUTION WIRES IN A MULTIPLE EXECUTION WIRE PROCESSOR
Technical Field [0001] The present description relates to a processor with multiple execution threads comprising hardware assistance for executing multiple threads in an interlaced manner. In particular, the present description relates to the aggregation of states produced by such wires during their completion, for example to represent the aggregated state of a plurality of nodes of a graph in an artificial intelligence algorithm .
BACKGROUND ART A multiple thread processor is a processor capable of executing multiple program threads side by side. The processor may include hardware that is common to the multiple different threads (eg, an instruction memory, a data memory, and / or a common thread); but to support multi-wire operation, the processor also includes dedicated hardware specific to each wire.
The dedicated hardware comprises at least one bank of respective context registers for each of the numerous execution threads which can be executed at the same time. A context, when talking about multi-thread processors, refers to the program status of a respective one of the threads running side by side (for example program counter value, state and current values d operands). The context register bank designates the respective set of registers intended to represent this program state of the respective thread. The registers of a bank of registers are distinct from the general purpose memory in that the addresses of the
B17780 FR -408528FR registers are set as bits in instruction words, while memory addresses can be calculated by executing instructions. The registers of a given context typically include a respective program counter for the respective thread of execution, and a respective set of registers of operands to temporarily maintain the data on which one acts and which are supplied by the respective thread during the calculations performed by this thread. Each context can also have a respective state register to store a state of the respective thread (for example if it is paused or running). Thus each of the threads in progress has its own separate program counter, and optionally operand registers and one or more status registers.
One possible form of multi-wire operation is parallelism. That is, as well as multiple contexts, multiple execution pipelines are provided: that is, there is a separate execution pipeline for each instruction flow to run in parallel. However, this requires a large amount of duplication when it comes to hardware.
Therefore, instead, another form of processor with multiple execution threads uses simultaneity rather than parallelism, from which it follows that the threads share a common execution pipeline (or at least a common part of a pipeline) and different threads are interleaved in this same shared execution pipeline. The performance of a multi-thread processor can be further improved compared to non-simultaneous or parallel operation, thanks to improved opportunities to hide pipeline latency. Also, this approach does not require as much additional hardware dedicated to each wire as in a completely processor
B17780 FR -408528FR parallel with multiple execution pipelines, and thus does not require as much additional silicon.
A form of parallelism can be obtained by means of a processor comprising an arrangement of multiple blocks on the same chip (that is to say the same elementary chip), each block respectively comprising separately its own processing unit and its own memory (including program memory and data memory). Thus separate portions of program code can be executed in parallel on different blocks. The blocks are connected to each other via an interconnection on the chip which allows the code executed on different blocks to communicate between the blocks. In some cases, the processing unit of each block can itself execute multiple simultaneous threads on the block, each block having its own set of respective contexts and its corresponding pipeline as described above in order to support the intertwining of multiple wires on the same pad across the same pipeline.
An example of the use of multi-wire and / or multi-block processing is found in artificial intelligence. As is known to those skilled in the art in the field of artificial intelligence, an artificial intelligence algorithm is based on performing iterative updates of a knowledge model, which can be represented by a graph of multiple interconnected nodes. Each node represents a function of its inputs. Some nodes receive inputs from the graph and some nodes receive inputs from one or more other nodes, while the output of certain nodes forms the inputs of other nodes, and the output of certain nodes provides the output of the graph (and in some cases a given node can even have all of this: inputs to the graph, outputs
B17780 FR -408528FR of the graph and connections to other nodes). In addition, the function at each node is parameterized by one or more respective parameters, that is to say weights. During a learning step the goal is, on the basis of a set of experimental input data, to find values for the various parameters so that the graph as a whole generates a desired output for a range of possible entries. Various algorithms for achieving this are known in the art, such as a back propagation algorithm based on a stochastic gradient descent. On multiple iterations based on the input data, the parameters are gradually adjusted to reduce their errors, and thus the graph converges towards a solution. In a subsequent stage, the learned model can then be used to make predictions of outputs given a specified set of inputs or to make an inference regarding inputs (causes) given a specified set of outputs.
The implementation of each node will involve data processing, and the interconnections of the graph correspond to data to be exchanged between the nodes. Typically, at least part of the processing of each node can be carried out independently of some or all of the other nodes of the graph, and therefore large graphs present great opportunities for simultaneity and / or parallelism.
Summary of 1 ¹ invention We will describe in the following components of a processor having an architecture that has been developed to respond to problems that arise in the calculations involved in applications of artificial intelligence. The processor described here can be used as an accelerator
B17780 EN -408528EN work, that is, it receives a workload from an application running on a host computer, the workload usually being in the form of very large sets of data to be processed (such as the large experimental data sets used by an artificial intelligence algorithm to learn a knowledge model, or the data from which a prediction or an inference must be made using a previously acquired knowledge model). One goal of the architecture presented here is to process these very large amounts of data with great efficiency. The processor architecture was developed to deal with workloads involved in artificial intelligence. However, it will be clear that the architecture described can also be suitable for other workloads sharing similar characteristics.
When multiple threads are executed in a multi-thread processing unit, it may be asked to determine a state of the program as a whole after all the desired threads have completed their task or their respective tasks, for example to determine whether an exception should be reported to the host or not, or to make a connection decision to determine whether to connect to a next part of the program or continue iterations of the current part. For example, if each of a plurality of threads represents a respective node in an artificial intelligence graph or subgraph, it may be desired that a portion of program supervision determines whether all of its working threads have satisfied a certain condition indicating that the graph converges to a solution. To make such a determination using existing techniques one needs a number of programmed steps using general purpose instructions.
B17780 FR -408528EN [0011] It is recognized that it would be desirable to adapt the instruction set of a processor to applications with large-scale multi-wire capacities such as machine learning. According to the present description, this is achieved by providing a dedicated machine code instruction by which a working wire terminates itself and at the same time causes an output state of that wire to contribute to an overall output state for multiple wires, thereby giving the possibility determine an overall result of multiple threads with reduced computational load, faster execution time and lower code density.
According to one aspect described here, a processor is provided comprising:
an execution pipeline comprising an execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in a processor instruction set, each instruction type in the instruction set instructions being defined by a corresponding operation code and by zero or more fields of operands to take zero or more operands;
multiple sets of context registers;
a scheduler arranged to control the execution pipeline to provide a repetitive sequence of temporally interleaved time slots, thereby enabling at least one respective work thread to be allocated for execution in each respective slot of some or all time slots, wherein a respective job stream program state running in each time slot is maintained in a respective one of the sets of context registers; and an output state register arranged to store an aggregated output state of the working wires;
wherein the instruction set includes an instruction for
B17780 FR -408528FR output intended to be included in each of the working wires, the output instruction taking at least one individual output state of the respective wire as operand; and in which the execution unit comprises dedicated hardware logic arranged so as, in response to the operation code of the output instruction, to terminate the execution of the respective work thread in its respective time slot, and also to cause the individual exit state specified in the operand to contribute to the aggregate exit state in the exit state register.
In embodiments, the output instruction may include a single operand field taking a single operand in the form of the individual output state.
In embodiments, each of the individual output states and the aggregated output states may be only one bit.
In embodiments, the aggregation can consist of a Boolean AND of the individual output states, or of a Boolean OR of the individual output states.
In embodiments, the aggregated output state may include at least two bits representing a trinary value, indicating whether the individual binary output states are all 1, all 0 or mixed.
In embodiments, the processor can be arranged in addition, in one or more of the time slots, during at least certain repetitions of said sequence of time slots, to execute a supervisor wire which allocates the working wires to the slots. respective execution.
In embodiments, the multiple sets of context registers may include multiple sets of working context registers, each set
B17780 FR -408528EN this registers ae working context being agency to maintain the state of the program of the respective working thread running in the respective time slot during the execution of the respective working thread, and a set of registers additional supervisor context comprising additional registers arranged to store a program state of the supervisor wire.
In real modes start to execute in time chs, then it abandons time slots to the son the output instruction can resume its execution in 1 <work wire which has executed 1 station, the supervisor wire can icun the plurality of slots, some or all of the respective work; and bringing the supervisor wire to the respective time slot of the output instruction.
In embodiments, the instruction set can further comprise an abandonment instruction and the execution stage can be arranged to effect the abandonment of the respective execution niche in response to the operation code of the abandonment instruction executed by the supervisor wire in the respective time slot which is abandoned.
In embodiments, the processor can include a group of blocks each of which includes an instance of the execution stage, multiple contexts, the scheduler and the output status register; and the processor may further include an interconnection for communicating between the tiles.
In embodiments, the interconnection may include dedicated hardware logic arranged to automatically aggregate the aggregated output states from the group of tiles into a global aggregate, and to make the global aggregate available to at least l 'one of the wires on each of the pavers.
B17780 FR -408528EN In embodiments, said at least one wire comprises the supervisor wire.
In embodiments, each of the blocks can further comprise a global aggregate register arranged to be readable by said at least one wire on this block; and the logic in the interconnection can be arranged to automatically make the global aggregate available for said at least one wire on each block by automatically storing the global aggregate in the global aggregate register on each block.
In embodiments, the interconnection may include an actuation synchronization controller for applying a massive synchronous parallel exchange process to communications between blocks, where it follows that when each of the blocks is programmed to perform a phase inter-block exchange and a calculation phase on the block then either a) the exchange phase is retained until all the working threads on all the blocks in the group have completed the calculation phase, or b ) the calculation phase is retained until all the blocks in the group have completed the exchange phase.
In embodiments, the instruction set may further comprise a barrier synchronization instruction to be included in one of the wires in each of the blocks following (a) the calculation phase or (b ) the exchange phase, respectively;
on each of the blocks, the execution stage can be arranged to, on execution of the barrier synchronization instruction, send a synchronization request to the synchronization controller in the interconnection; and the synchronization controller can be arranged to return a synchronization acknowledgment signal
B17780 FR -408528FR to each of the blocks in response to receipt of an instance of the synchronization request signal from all blocks, the synchronization acknowledgment signal releasing the next exchange phase or (b) calculation phase, accordingly.
In embodiments, the exchange phase can be arranged to be performed by the supervisor wire.
In embodiments, the processor can be programmed to perform an artificial intelligence algorithm in which each node of a graph has one or more respective input edges and one or more respective output edges, the edges input of at least some of the nodes being the output edges of at least some of the other nodes, each node comprising a respective function connecting its output edges to its input edges, each respective function being parameterized by one or several respective parameters, and each of the respective parameters having an associated error, so that the graph converges to a solution when the errors in some or all of the parameters are reduced; wherein each of the working threads can model a respective one of the nodes of the graph, and each of the individual output states can be used to indicate whether the errors in said one or more parameters of the respective node have satisfied a predetermined condition.
According to another aspect described here, there is provided a method of actuating a processor comprising an execution pipeline and multiple sets of context registers, the execution pipeline comprising an execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in
B17780 FR -408528FR a set of instructions of the processor, each type of instruction in the set of instructions being defined by a corresponding operation code and by zero or more fields of operands to take zero or more operands; in which the method comprises:
schedule the execution pipeline to obtain a repetitive sequence of time interleaved time slots, thereby allowing at least one respective work thread to be allocated for execution in each respective slot of some or all of the time slots, in which a program state of the respective work thread running in each slot is maintained in a respective one of the sets of context registers; and at least temporarily maintaining an aggregated output state of the working wires in an output state register of the processor;
wherein the instruction set includes an output instruction which is included in each of the work threads, the output instruction taking at least one individual output state from the respective work thread as operand; and the method comprises, in response to the operation code of the output instruction in execution, the triggering of dedicated hardware logic of the processor to end the execution of the respective work thread in its respective time slot, and to bring the individual exit state specified in the operand to contribute to the aggregate exit state in the exit state register.
According to another aspect described here, there is provided a computer program product incorporated on a storage readable by a computer and comprising code arranged to run on the processor of any of the embodiments described here, the code including the sons of
B17780 FR -408528FR work including the output instruction in each work thread.
Brief description of the drawings To facilitate understanding of the present description and to show how embodiments can be implemented, reference is made by way of example to the attached drawings in which:
[Fig. 1] Figure 1 is a block diagram of a multi-wire processing unit;
[Fig. 2] Figure 2 is a block diagram of a plurality of child contexts;
[Fig. 3] FIG. 3 illustrates a diagram of interlaced execution time slots;
[Fig. 4] Figure 4 illustrates a supervisor wire and a plurality of working wires;
[Fig. 5] Figure 5 is a logic block diagram for aggregating output states of multiple wires;
[Fig. 6] FIG. 6 schematically illustrates the synchronization between working wires on the same block;
[Fig. 7] Figure 7 is a block diagram of a processor chip comprising multiple blocks;
[0039] [Fig . 8] figure 8 is a illus tration schematic a model parallel computing synchronous massive (BSP); [0040] [Fig 9] figure 9 East a other drawing schematic a BSP model; [0041] [Fig 10] the figure 10 East a drawing
schematic of BSP between multi-wire processing units;
[Fig. 11] Figure 11 is a block diagram of an interconnection system;
B17780 FR -408528FR [0043] [Fig. 12] Figure 12 is a schematic illustration of a system of multiple interconnected processor chips;
[Fig. 13] Figure 13 is a schematic illustration of a multilevel BSP scheme;
[Fig. 14] Figure 14 is another schematic illustration of a system of multiple processor chips;
[Fig. 15] Figure 15 is a schematic illustration of a graph used in an artificial intelligence algorithm; and [0047] [Fig. 16] Figure 16 illustrates a wiring example for synchronization between chips.
Detailed description of preferred embodiments We will describe in the following a processor architecture which includes a dedicated instruction in its instruction set to end the thread in which the instruction is executed and at the same time incorporate a state of this wire when terminated in an aggregated exit state for multiple wires running in the same pipeline, for example on the same pad. In embodiments, a global aggregate output status register is also located in each of multiple tiles, and contains the same result for each tile that has been aggregated. However, we will first describe an example of a processor in which this can be incorporated with reference to FIGS. 1 to 4.
Figure 1 illustrates an example of processor module 4 according to embodiments of the present description. For example, the processor module 4 can be a block of a matrix of similar processor blocks on the same chip, or can be implemented in the form of a stand-alone processor on its own chip. Processor module 4 includes a unit
B17780 FR -408528FR multi-wire processing 10 in the form of a barrel processing unit, and a local memory 11 (that is to say on the same block in the case of a multi-block matrix, or the same chip in the case of a single processor chip). A barrel processing unit is a type of multi-wire processing unit in which the pipeline execution time is divided into a repeating sequence of interleaved time slots, each of which may be owned by a given wire. This will be described in more detail in a moment.
The memory 11 includes an instruction memory 12 and a data memory 22 (which can be implemented in various different addressable memory modules or in different regions on the same addressable memory module). The instruction memory 12 stores machine code to be executed by the processing unit 10, while the data memory 22 stores both data on which the executed code will operate and output data produced by the executed code (for example a result of such operations).
The memory 12 stores various different threads of a program, each thread comprising a respective sequence of instructions for performing a certain task or certain tasks. Note that an instruction as designated here designates a machine code instruction, that is to say an instance of one of the fundamental instructions of the processor instruction set, consisting of a single operation code and of zero or more operands.
The program described here includes a plurality of working son, and a supervisor subroutine which can be arranged in the form of one or more supervisor son. This will be described in more detail in a moment. In embodiments, each of some or all of the working threads has the shape of a respective codelet. A codelet
B17780 FR -408528FR is a special type of thread, sometimes also called an atomic thread. It has all the input information it needs for its execution from the start of the thread (from the moment of launch), i.e. it takes no input from any other part in the program or in memory after being launched. In addition, no other part of the program will use outputs (results) of the wire until it is finished (it ends). Unless he encounters an error, he is guaranteed to finish. It will be noted that certain literatures also define a codelet as being stateless, that is to say that if it is executed twice it will not be able to inherit any information coming from its first execution, but this additional definition does not is not adopted here. It will also be noted that not all of the working threads need to be (atomic) codelets, and in some embodiments some or all of the working threads may instead be able to communicate with each other.
In the processing unit 10, multiple different threads among the threads coming from the instruction memory 12 can be interleaved in a single execution pipeline 13 (although typically only a subset of all the threads stored in the instruction memory can be interleaved at any point in the global program). The multi-thread processing unit 10 comprises: a plurality of banks of registers 26, each arranged to represent the state (context) of a respective different thread among the threads to be executed simultaneously; a shared execution pipeline 13 which is common to the threads executed simultaneously; and a scheduler 24 for scheduling the simultaneous threads for their execution in the shared pipeline in an interlaced manner, preferably in turn. The processing unit 10 is connected to a memory
B17780 FR -408528FR of shared instructions 12 common to the plurality of wires, and to a shared data memory 22 which is still common to the plurality of wires.
The execution pipeline 13 comprises an extraction stage 14, a decoding stage 16, and an execution stage 18 comprising an execution unit which can carry out arithmetic and logical operations, address calculations , and loading and storage operations, and other operations, as defined by the instruction set architecture. Each of the context register banks 26 includes a respective set of registers for representing the program state of a respective thread.
[0054]
An example of the registers which each constitute the banks of context registers 26 is illustrated in FIG. 2.
Each of the context register banks 26 includes one or more respective control registers 28, including at least one program counter (PC) for the respective thread (to keep track of the instruction address at which the thread is in running), and in embodiments also a set of one or more status registers (SR) recording a current state of the respective thread (as if it is running or paused, by example since it encountered an error). Each of the context register banks 26 also comprises a respective set of operand registers (OP) 32, for temporarily maintaining operands of the instructions executed by the respective thread, that is to say values on which one operates or resulting from operations defined by the operation codes of the instructions of the respective threads when they are executed. It will be noted that each of the banks of context registers 26 can optionally comprise one or more other types of respective registers (not
B17780 FR -408528FR shown). It will also be noted that although the term bank of registers is sometimes used to designate a group of registers in a common address space, this need not necessarily be the case in the present description and each of the material contexts 26 (each of the sets of registers 26 representing each context) can more generally comprise one or more banks of registers of the kind.
As will be described in more detail below, the arrangement described comprises a bank of work context registers CXO ... CX (Ml) for each of the M wires which can be executed simultaneously (M = 3 in the example illustrated, but this is not limiting), and a bank of additional supervisor context registers CXS. The banks of work context registers are reserved for memorizing the contexts of work threads, and the bank of registers of supervisor context is reserved for memorizing the context of a supervisor wire. It will be noted that in embodiments the supervisor context is special, and that it comprises a different number of registers compared to that of the working threads. Each of the working contexts preferably includes the same number of status registers and operand registers as the others. In embodiments, the supervisor context may include fewer operand registers than each of the work threads. Examples of operand registers that the work context may include and that the supervisor does not include: floating point registers, accumulator registers, and / or dedicated weighting registers (to contain neural network weights ). In embodiments, the supervisor can also include a different number of status registers. Additionally, in embodiments of the game architecture
B17780 FR -408528FR of instructions for processor module 4 can be arranged so that the working threads and the supervisor thread (s) execute different types of instructions but also share certain types of instructions.
The extraction stage 14 is connected so as to extract from the instruction memory 12 instructions to be executed, under the control of the scheduler 24. The scheduler 24 is arranged to control the stage of extraction 14 to extract an instruction from each thread of a set of threads executing simultaneously in a repetitive sequence of time slots, thus dividing the resources of the pipeline 13 into a plurality of time slots interleaved in time, as will be seen describe in more detail in a moment. For example, the scheduling scheme could be a turn or a weighted turn. Another term for a processor operating in this way is a barrel execution thread processor.
In some embodiments, the scheduler 24 may have access to one of the SR status registers of each thread indicating whether the thread is paused, so that the scheduler 24 actually controls the extraction stage 14 for extracting the instructions from only the wires which are currently active. In embodiments, preferably each time slot (and the corresponding context register bank) is always owned by one thread or another, that is to say that each slot is always occupied by a certain thread, and each slot is always included in the sequence of the scheduler 24; although it may happen that the thread occupying a given slot can be paused at this instant, in which case when the sequence comes to this slot, the instruction extraction for the respective thread is skipped. As a variant, it is not excluded for example that
B17780 FR -408528FR In less preferred variant embodiments, certain slots may be temporarily vacant and excluded from the scheduled sequence. When referring to the number of time slots that the thread is capable of interleaving, or the like, this means the maximum number of slots that the thread is capable of running simultaneously, that is that is, the number of simultaneous slots that the thread's hardware supports.
The extraction stage 14 has access to the program counter (PC) of each of the contexts. For each respective thread, the extraction stage 14 extracts the next instruction from this thread from the next address in the program memory 12 as indicated by the program counter. The program counter increments with each execution cycle unless it is bypassed by a branch instruction. The extraction stage 14 then passes the extracted instruction to the decoding stage 16 so that it is decoded, and the decoding stage 16 then passes an indication of the decoded instruction to the execution unit 18 accompanied by the decoded addresses of all the operand registers 32 specified in the instruction, so that the instruction is executed. The execution unit 18 has access to the operand registers 32 and to the control registers 28, which it can use in the execution of the instruction on the basis of the addresses of decoded registers, as in the case of an arithmetic instruction (for example by adding, multiplying, subtracting or dividing the values in two operand registers and providing the result to another operand register of the respective wire). Or if the instruction defines a memory access (loading or storage), the loading / storage logic of the execution unit 18 loads a value from the data memory into an operand register of the respective thread, or memorizes a value from
B17780 FR -408528FR of an operand register of the respective wire in the data memory 22, in accordance with the instruction. Or if the instruction defines a connection or a change of state, the execution unit changes the value in the program counter PC or one of the state registers SR accordingly. It will be noted that while an instruction of a thread is executed by the execution unit 18, an instruction originating from the thread located in the next time slot in the interlaced sequence may be being decoded by the stage of decoding 16; and / or while an instruction is decoded by the decoding stage 16, the instruction originating from the wire being in the next time slot after this one may be being extracted by the extraction stage 14 (although in general the scope of the description is not limited to an instruction by time slot, for example in scenario variants a batch of two or more instructions could be issued by a given thread by time slot). Interlacing thus advantageously masks the latency in the pipeline 13, in accordance with known techniques for processing barrel wires.
An example of the interleaving diagram implemented by the scheduler 24 is illustrated in FIG. 3. Here the simultaneous wires are interleaved according to a rotation diagram by which, in each revolution of the diagram, the revolution is divided in a sequence of time slots S0, SI, S2. „, each to execute a respective thread. Typically, each slot has a length of one processor cycle and the different slots have equal sizes although this is not necessary in all possible embodiments, for example a weighted turn diagram is also possible in which some threads get more cycles than others on each run. In general, the execution of barrel wires can use either an equal turn pattern or a
B17780 FR -408528FR weighted rotation diagram, in the latter case the weighting can be fixed or adaptive.
Whatever the sequence for each execution turn, this pattern is repeated, each turn comprising a respective instance of each of the time slots. It should therefore be noted that a time slot as designated here designates the place allocated repeatedly in the sequence, not a particular instance of the slot in a given repetition of the sequence. In other words, the scheduler 24 divides the execution cycles of the pipeline 13 into a plurality of temporally interleaved execution channels (multiplexed by time separation), each comprising a recurrence of a respective time slot in a sequence repetitive time slots. In the illustrated embodiment, there are four time slots, but this is only for purposes of illustration and other numbers are possible. For example, in a preferred embodiment there are actually six time slots.
Whatever the number of time slots into which the turn-based scheme is divided, then according to the present description, the processing unit 10 comprises a bank of context registers 26 more than the number of time slots , that is, it supports one more context than the number of interlaced time slots it is capable of processing in barrels.
This is illustrated by means of the example in the figure: if there are four time slots S0 ... S3 as shown in Figure 3, then there are five banks of context registers, referenced here CX0, CX1, CX2, CX3 and CXS. That is to say that even if there are only four execution time slots S0 ... S3 in the barrel wire diagram and so
B17780 FR -408528FR only four wires can be executed simultaneously, it is described here to add a fifth bank of CXS context registers, comprising a fifth program counter (PC), a fifth set of operand registers 32, and in embodiments also a fifth set of one or more status registers (SR). Note, however, that as mentioned, in embodiments, the supervisor context may differ from the other CX0 ... 3, and the supervisor thread can support a different set of instructions to activate the execution pipeline 13.
Each of the first four contexts CX0 ... CX3 is used to represent the state of the respective one of a plurality of work threads currently assigned to one of the four execution time slots S0 ... S3, to perform any specific calculation task of an application desired by the programmer (it will also be noted that this may be only the subset of the total number of program work threads as stored in the instruction memory 12) . The fifth context CXS is however reserved for a special function, to represent the state of a supervisor thread (SV) whose role is to coordinate the execution of the work threads, at least in the direction of the assignment of that working threads W which must be executed in such a time slot S0, SI, S2. . . and how well in the overall program. Optionally, the supervisor thread may have other supervisory or coordinating responsibilities. For example, the supervisor wire can be responsible for carrying out barrier synchronizations to ensure a certain order of execution. For example, in the case where one or more second wires depend on data to be supplied by one or more first wires executed on the same processor module 4, the supervisor can perform barrier synchronization
B17780 FR -408528FR to ensure that none of the second wires start before the first wires are finished. In addition or instead, the supervisor can perform barrier synchronization to ensure that one or more wires on the processor module 4 do not start before a certain external data source, such as another pad or chip. processor, has completed the processing required to make data available. The supervisor wire can also be used to perform other functionalities associated with the multiple work wires. For example, the supervisor wire can be responsible for communicating external data to the processor module 4 (to receive external data on which it is necessary to act with one or more of the wires, and / or to transmit data supplied by one or more of the wires. of work). In general, the supervisor wire can be used to provide any kind of supervision or coordination function desired by the programmer. For example, in another example, the supervisor can supervise transfers between the local block memory 12 and one or more resources in the larger system (external to the array 6) such as a storage disk or a network card.
It will of course be noted that four slots are only one example, and that in general, in other embodiments there may be other numbers, so that if there is a maximum of M time slots 0 ... Ml per revolution, processor module 4 includes M + l contexts CX ... CX (M-1) &
CXS, i.e. one for each work thread which can be interleaved at any given time and an additional context for the supervisor. For example, in an example implementation there are six time slots and seven contexts.
Referring to Figure 4, the supervisor wire SV does not have its own time slot per se in the diagram of interleaved time slots. The same is true for
B17780 FR -408528FR working wires since the allocation of slots to working wires is defined in a flexible way. Instead, each time slot has its own dedicated context register bank (CX0 ... CXM-1) to store the work context, which is used by the work thread when the time slot is allocated to the work thread , but not used when the slot is allocated to the supervisor. When a given slot is allocated to the supervisor, instead this slot uses the supervisor's CXS context register bank. Note that the supervisor always has access to his own context and that no work thread is able to occupy the CXS supervisor context register bank.
The supervisor wire SV has the capacity to run in any one and in all the time slots S0 .... S3 (or more generally S0 ... SM-1). The scheduler 24 is arranged for, when the program as a whole starts, to start by allocating to the supervisor wire all of the time slots, that is to say that thus the supervisor SV starts by executing in all the slots S0 ... S3. However, the supervisor thread is provided with a mechanism for, at a certain later point (either immediately or after having performed one or more supervisor tasks), temporarily abandon each of the slots in which it is executed at a respective one of the working wires, for example initially the working wires W0 ... W3 in the example represented in figure
4. This is achieved by the fact that the supervisor thread is executing an abandon instruction, called RUN by way of example here. In embodiments, this instruction takes two operands: an address of a work thread in the instruction memory 12 and an address of certain data for this work thread in the data memory 22:
RUN task_addr, data_addr
B17780 FR -408528EN The working threads are portions of code which can be executed simultaneously between them, each representing one or more respective calculation tasks to be carried out. The data address can specify certain data on which the work thread should act. Alternatively, the abandon instruction may take a single operand specifying the address of the work thread, and the address of the data could be included in the code of the work thread; or in another example the single operand could point to a data structure specifying the addresses of the work thread and the data. As mentioned, in embodiments at least some of the working threads can take the form of codelets, i.e., atomic code units executable simultaneously. Alternatively or additionally, some of the working threads need not be codelets and may instead be able to communicate with each other.
The abandonment instruction (RUN) acts on the scheduler 24 so as to abandon the current time slot, in which this instruction is executed itself, at the work wire specified by the operand. Note that it is implicit in the abandonment instruction that it is the time slot in which this instruction is executed that is abandoned (implicit in the context of machine code instructions means that there is no need of an operand to specify this - it is implicitly understood from the operation code itself). So the time slot that is abandoned is the time slot in which the supervisor executes the abandonment instruction. Or put it another way, the supervisor runs in the same space as the one he abandons. The supervisor says to execute this piece of code at this location, then from this point the time slot
B17780 FR -408528FR recurring is (temporarily) owned by the affected work thread.
The supervisor wire SV performs a similar operation in each of one or more of the other time slots, to abandon some or all of its time slots to different respective wires among the working wires W0 ... W3 (selected in a larger set W0 ... wj in the instruction memory 12). Once he has done this for the last slot, the supervisor is suspended (he will resume later where he left when one of the slots is returned by a working thread W).
The supervisor wire SV is thus capable of allocating different work wires, each carrying out one or more tasks, to different slots among the interleaved execution time slots S0 ... S3. When the supervisor thread determines that it is time to execute a work thread, it uses the RUN instruction to allocate this work thread to the time slot in which the RUN instruction has been executed.
In certain embodiments, the instruction set also includes a variant of the execution instruction, RUNALL (execute all). This instruction is used to launch a set of several work threads together, all of them executing the same code. In embodiments, this launches a working thread in each of the slots of the processing unit S0 ... S3 (or more generally
S0 ... S (M-1)).
In addition, in some embodiments, the RUN and / or RUNALL instructions, when executed, also automatically copy a state from one or more of the CXS supervisor state registers (SR) into a or several corresponding status registers of the child (ren)
B17780 FR -408528FR of work launched by the RUN or RUNALL instructions. For example, the copied state can include one or more modes, such as a floating point rounding mode (for example rounded to the nearest or rounded to zero) and / or an overflow mode (for example saturates or uses a value representing infinity). The copied state or mode then controls the work thread in question to operate in accordance with the copied state or mode. In embodiments, the work thread can later overwrite this in its own state register (but cannot change the state of the supervisor). In other variations or additional embodiments, the work threads can choose to read a certain state from one or more supervisor state registers (and again can change their own state later). For example, here again this could consist in adopting a mode from the supervisor's status register, such as a floating point mode or a rounding mode. However, in embodiments, the supervisor cannot read any of the CXO ... context registers of the work threads.
Once launched, each of the currently allocated working threads W0 ... W3 proceeds to carrying out one or more calculation tasks defined in the code specified by the respective abandonment instruction. At the end of this, the respective work thread then returns the time slot in which it is running to the supervisor thread. This is achieved by executing an exit instruction (EXIT).
The EXIT instruction takes at least one operand and preferably a single operand, exit_state (for example a binary value), to be used for any purpose desired by the programmer to indicate a state of the respective codelet at its termination (for example to indicate if a certain
B17780 FR -408528FR condition has been met):
EXIT exit_state The EXIT instruction acts on the scheduler 24 so that the time slot in which it is executed is returned to the supervisor thread. The supervisor wire can then carry out one or more of the following supervision tasks (for example a barrier synchronization and / or an exchange of data with external resources such as other blocks), and / or continue to execute another abandon instruction. to allocate a new working thread (W4, etc.) to the niche in question. It will again be noted that consequently the total number of execution threads in the instruction memory 12 may be greater than the number of threads that the barrel thread processing unit 10 can interleave at any time. It is the role of the supervisor thread SV to plan which of the working threads W0 ... Wj coming from the instruction memory 12, at what stage in the overall program, must be assigned to such a slot of the interleaved time slots S0. ..SM in the schedule diagram of the scheduler 24.
In addition, the EXIT instruction has been given another special function, namely that of bringing the output state specified in the operand of the EXIT instruction to be automatically aggregated (by logic dedicated hardware) with the output states of a plurality of other working threads which are executed in the same pipeline 13 of the same processor module 4 (for example the same block). So an additional implicit facility is included in the instruction to end a thread.
An example of a circuit for achieving this is shown in FIG. 5. In this example, the output states of the individual wires and the aggregate output state take
B17780 FR -408528FR each form a single bit, that is to say 0 or 1. The processor module 4 includes a register 38 for storing the aggregate output state of this processor module 4. This register can be here called the local consensus register $ LC (as opposed to global consensus when the processor module 4 is part of a matrix of similar processor blocks, as will be described in more detail in a moment). In embodiments, this local consensus register $ LC 38 is one of the supervisor state registers in the supervisor context register bank CXS. The logic for performing the aggregation comprises an AND gate 37 arranged to perform a logical AND of (A) the output state specified in the operand of the instruction EXIT and (B) the current value in the local consensus register ($ LC) 38, and to return the result (Q) in the local consensus register $ LC 38 as the new value of the local aggregate.
At an appropriate synchronization point in the program, the value stored in the local consensus register ($ LC) 38 is initially reset to a value of 1. That is, all the wires outgoing execution after this point will contribute to the locally aggregated output state $ LC until the next reset. The output (Q) of the AND gate 37 is at 1 if the two inputs (A, B) are at 1, but otherwise the Q output goes to zero if any of the inputs (A, B) is at 0. Each time an EXIT instruction is executed, its exit status is aggregated with those which arrived previously (since the last reset). Thus by means of the arrangement represented in FIG. 5, the logic maintains a current aggregate of the output states of all the work threads which have ended by means of an EXIT instruction since the last time that the local consensus register ( $ LC) 38 has been reset. In
B17780 EN -408528EN In this example, the current aggregate indicates whether all the execution threads so far have come out true: any output state at 0 coming from any of the working threads will bring the aggregate into register 38 to become locked at 0 until the next reset. In embodiments, the supervisor SV can read the current aggregate at any time by obtaining the current value from the local consensus register ($ LC) 38 (he does not need to wait for synchronization on the pad To do that).
The reinitialization of the aggregate in the local consensus register ($ LC) 38 can be carried out by the supervisor SV by carrying out a PUT in the address of the local consensus register ($ LC) 38 using one or more general purpose instructions, in this example to put a value of 1 in register 38. As a variant, it is not excluded that the reinitialization may be carried out by an automatic mechanism, for example triggered by executing the SYNC instruction described later here.
The aggregation circuit 37, in this case the AND gate, is implemented in a dedicated hardware circuit in the execution unit of the execution stage 18, using any combination of appropriate electronic components to realize the functionality of a boolean AND. A dedicated circuit or hardware designates circuits having a wired function, unlike being programmed by software using general purpose code. The update of the local output state is triggered by the execution of the special instruction EXIT, this being one of the instructions of fundamental machine code in the instruction set of processor module 4, having the inherent functionality of aggregating output states. Also, the local aggregate is stored in a command register 38, that is to say a dedicated storage element (in certain modes
B17780 FR -408528FR implementation, a single storage bit) whose value can be accessed by the code executing in the pipeline, but which cannot be used by the load-storage unit (LSU) to store general purpose data. Instead, the data function held in a command register is fixed, in this case to the locally aggregated output state storage function. Preferably the local consensus register ($ LC) 38 forms one of the command registers on the processor module 4 (for example on the keypad), the value of which can be accessed by the supervisor by executing a GET instruction and which can be defined by the execution of a PUT instruction.
Note that the circuit shown in Figure 5 is only an example. An equivalent circuit would consist of replacing the AND gate 37 with an OR gate and reversing the interpretation of the output states 0 and 1 by software, i.e. 0 true, 1 false (with the register 38 reset to 0 instead of 1 at each synchronization point). Equivalently, if the AND gate is replaced by an OR gate but the interpretation of the output states is not reversed, and neither is the reset value, then the aggregate state in $ LC will save if one any (rather than all) of the worker states is output with state 1. In other embodiments, the output states need not be single bits. For example, the output state of each individual work thread can be a single bit, but the aggregate output state $ LC can include two bits representing a trinary state: all the work threads are output with state 1 , all the working threads are output with state 0, or the output states of the working threads were mixed. As an example of the logic to implement this, one of the two bits encoding the trinary value can be a Boolean AND (or an OR) of the output states
B17780 FR -408528FR individual, and the other bit of the trinary value can be a Boolean OR of the individual output states. The third coded case, indicating that the output states of the working wires are mixed, can then be formed by the Exclusive OR of these two bits.
The exit states can be used to represent what the programmer wishes, but a particularly envisaged example consists in using an exit state at 1 to indicate that the respective working thread is left with a success or true state, while that an exit state at 0 indicates that the respective work thread has left with a failed or false state (or vice versa if the aggregation circuit 37 performs an OR instead of an AND and the register $ LC 38 is initially reset to 0). For example, if we consider an application where each work thread performs a calculation with an association condition, as a condition indicating whether the error or errors in said one or more parameters of a respective node in the graph of an artificial intelligence algorithm are at an acceptable level according to a predetermined metric. In this case, an individual output state of a given logic level (for example 1) can be used to indicate that the condition is satisfied (for example that the error or errors in said one or more parameters of the node are at an acceptable level according to a certain metric); whereas an individual output state of the opposite logic level (for example 0) can be used to indicate that the condition was not satisfied (for example the error or errors are not at an acceptable level according to the metric in question). The condition can for example be an error threshold placed on a single parameter or on each parameter, or could be a more complex function of a plurality
B17780 FR -408528FR of parameters associated with the respective calculation carried out by the working thread.
In another more complex example, the individual output states of the working wires and the aggregate output state may each comprise two or more bits, which can be used, for example, to represent a degree of confidence in the results of work threads. For example, the output state of each individual work thread can represent a probabilistic measure of confidence in a result of the respective work thread, and the aggregation logic 37 can be replaced by a more complex circuit to achieve probabilistic aggregation individual confidence levels in material form.
Whatever the meaning which is given by the programmer to the output states, the supervisor wire SV can then obtain the aggregated value from the local consensus register ($ LC) 38 to determine the aggregated output state of all the working threads which have left since its last reset, for example at the level of the last synchronization point, for example to determine whether or not all the working threads have left in a state of success or true. Based on this aggregated value, the supervisor thread can then make a decision in accordance with the designer's choice of design. The programmer can choose to use the locally aggregated output state as he wishes. For example, the supervisor thread can consult the local aggregate output status to determine if a certain portion of the program consisting of a certain subset of work threads has ended as expected or desired. If it is not the case (for example at least one of the working threads left in a failed or false state), it can report to a host processor, or can carry out another iteration of the part of the program including the
B17780 FR -408528FR same working wires; but if it is the case (for example if all the working threads left with a state of success or true) it can instead connect to another part of the program comprising one or more new working threads.
Preferably, the supervisor thread should not access the value found in the local consensus register ($ LC) 38 before all the work threads in question are out, so that the value stored therein represents l correct updated state of all desired threads. This wait can be imposed by a barrier synchronization performed by the supervisor wire to wait for all the local work wires running simultaneously (i.e. those located on the same processor module 4, running in the same pipeline 13) came out. That is, the supervisor thread resets the local consensus register ($ LC) 38, launches a plurality of working threads, then launches local barrier synchronization (local to the processing module 4, local to a block ) in order to wait for all pending work threads to exit before the supervisor is authorized to proceed to obtaining the aggregated exit status from the local consensus register ($ LC) 38.
Referring to Figure 6, in embodiments a SYNC (synchronization) instruction is provided in the instruction set of the processor. The effect of this SYNC instruction is to cause the supervisor thread SV to wait until all the working threads W executing simultaneously have been output by means of an EXIT instruction. In embodiments the SYNC instruction takes a mode in the form of an operand (in embodiments it is only the operand), the mode specifying whether the SYNC instruction must
B17780 FR -408528FR act only locally in relation to only the working threads which is executed locally on the same processor module 4, for example the same keypad, as the supervisor as part on which the SYNC action is executed (this is -to say only the execution wires located in the same pipeline 13 of the same barrel-wire processing unit 10); or if instead it should be applied on multiple blocks or even on multiple chips.
SYNC mode // mode E {tile, chip, zone_l, zone_2} This will be described in more detail later but with regard to FIG. 6 a local SYNC will be assumed (SYNC tile, that is to say a synchronization in a single block).
The working threads do not need to be identified as operands of the SYNC instruction, since it is implicit that the supervisor SV is then automatically made to wait for none of the time slots S0, SI,. .. of the barrel wire processing unit 10 is occupied by a working wire. As shown in FIG. 6, once all the wires in a current batch of working wires WLn have been launched by the supervisor, the supervisor then executes a SYNC instruction. If the supervisor SV launches working wires W in all the slots S0 ... 3 of the barrel wire processing unit 10 (all four in the example illustrated, but this is only one example of implementation), then the SYNC instruction will be executed by the supervisor once the first element of the current batch of working threads WLn is out, thus returning control of at least one slot to the supervisor SV. Otherwise, if the work threads do not take up all of the slots, the SYNC instruction will simply be executed immediately after the last thread in the current batch WLn has been launched. Either way, the SYNC instruction causes the supervisor to
B17780 FR -408528FR Wait for all the other elements in the current batch of WLn-1 work wires to execute an EXIT before the supervisor can proceed. It is only after this that the supervisor executes a GET instruction to obtain the content of the local consensus register ($ LC) 38. This wait by the supervisor thread is imposed by the hardware once the SYNC instruction has been executed. That is to say that, in response to the operation code of the SYNC instruction, the logic located in the execution unit (EXU) of the execution stage 18 brings the extraction stage 14 and the scheduler 24 to pause the transmission of instructions from the supervisor thread until all of the pending work threads have executed an EXIT instruction. At some point after obtaining the value of the local consensus register ($ LC) 38 (optionally with another piece of supervisor code in between), the supervisor executes a PUT instruction to reset the local consensus register ($ LC) 38 (at 1 in the example illustrated).
As also illustrated in FIG. 6, the SYNC instruction can also be used to place synchronization barriers between different interdependent layers WL1, WL2, WL3, ... of working wires, where one or more threads in each successive layer is dependent on data produced by one or more working threads in the previous layer. The local SYNC executed by the supervisor thread guarantees that none of the work threads in the next layer WLn + 1 is executed before all the work threads in the immediately preceding layer WLn are out (by executing a EXIT instruction).
As has been mentioned, in embodiments, the processor module 4 can be implemented in the form of an array of interconnected blocks forming a
B17780 FR -408528FR multi-block processor, each of the blocks can be arranged as described above in relation to FIGS. 1 to 6.
This is illustrated in FIG. 7 which represents a processor in a single chip 2, that is to say a single elementary chip, comprising a matrix 6 of multiple processor blocks 4 and an interconnection on the chip 34 interconnecting the blocks 4. Chip 2 can be implemented alone in its own integrated circuit package with a single chip, or in the form of multiple elementary chips packaged in the same integrated circuit package, IC.
The interconnection on the chip can also be called here exchange fabric 34 since it allows the blocks 4 to exchange data with one another. Each block 4 comprises a respective instance of the barrel wire processing unit 10 and a memory 11, each arranged as described above in relation to FIGS. 1 to 6. For example, by way of illustration, the chip 2 may comprise on the order of a hundred paving stones 4, or even more than a thousand. To be complete, it will also be noted that a matrix as designated here does not necessarily imply a particular number of dimensions or a particular physical arrangement of the blocks
4.
In embodiments each chip 2 also includes one or more external links 8, allowing the chip 2 to be connected to one or more other external processors on different chips (for example one or more other instances of the same bullet 2). These external links 8 may include any one or more of: one or more chip-to-host links to connect the chip 2 to a host processor, and / or one or more chip to chip links to connect with one or more other instances of chip 2 on the same IC box or the same card, or on different cards. In one
B17780 FR -408528EN example of arrangement, the chip 2 receives work from a host processor (not shown) which is connected to the chip via one of the chip-to-host links in the form of data d input to be processed by chip 2, Multiple instances of chip 2 can be connected together in cards by chip-to-chip links. Thus, a host can access a computer which has an architecture comprising a processor in a single chip 2 or comprising multiple processors in a single chip 2 possibly arranged on multiple interconnected cards, depending on the workload required for the application. host.
The interconnection 34 is arranged to allow the various processor blocks 4 located in the matrix 6 to communicate with each other on the chip 2. However, as there may be potentially dependencies between execution threads on the same block 4, there may also be dependencies between the portions of the program running on different blocks 4 in matrix 6. A technique is therefore necessary to prevent a piece of code on a given block 4 from running ahead of data on which it depends which is made available by another piece of code on another block 4.
In certain embodiments, this is obtained by implementing a massive synchronous parallel exchange scheme (BSP), as illustrated diagrammatically in FIGS. 8 and 9.
According to a version of BSP, each block 4 performs a calculation phase 52 and an exchange phase 50 in an alternating cycle, separated from each other by a barrier synchronization 30 between the blocks. In the illustrated case, a barrier synchronization is placed between each phase of
B17780 FR -408528FR calculation 52 and the next exchange phase 50. During the calculation phase 52 each block 4 performs one or more calculation tasks locally on the block, but does not communicate the results of these calculations to other blocks 4. In the exchange phase 50 each block 4 is authorized to exchange one or more results of the calculations from the previous calculation phase with one or more of the other blocks in the group, but does not perform any new calculation before having received from other blocks 4 the data including its task or tasks dependent. It also does not send data to other blocks, except those calculated in the previous calculation phase. It is not excluded that other operations such as operations associated with internal control can be carried out in the exchange phase. In certain embodiments, the exchange phase 50 does not include non-deterministic calculations over time, but a small number of deterministic calculations over time can optionally be authorized during the exchange phase 50. It will also be noted that a block 4 carrying out a calculation can be authorized during the calculation phase 52 to communicate with other system resources external to the matrix of blocks 4 which is synchronized - for example a network card, a disk drive, or a network of doors programmable on site (FPGA) - as long as it does not involve communication with other blocks 4 in the group which is synchronized. Communication external to the group of blocks can optionally use the BSP mechanism, but as a variant may not use the BSP and instead use another synchronization mechanism of its own.
According to the BSP principle, a barrier synchronization 30 is placed at the join making the transition between the calculation phases 52 and the exchange phase 50, or the join making the transition between the exchange phases 50 and the
B17780 FR -408528FR calculation phase 52, or both. This means that either: (a) all the blocks 4 must complete their respective calculation phases 52 before one of the blocks in the group is authorized to proceed to the next exchange phase 50, or (b ) all the blocks 4 in the group must complete their respective exchange phases 50 before one of the blocks in the group is authorized to proceed to the next calculation phase 52, or (c) these two conditions are imposed. In the three variants it is the individual processors which alternate between the phases, and the overall assembly which synchronizes. The sequence of exchange and calculation phases can then be repeated multiple times. In BSP terminology, each repetition of exchange phase and calculation phase is sometimes called super-step (it should be noted, however, that in the literature terminology is not always used consistently: sometimes each exchange phase and each individual calculation phase is individually called super-steps, while elsewhere, as in the terminology adopted here, the exchange and calculation phases together are called super-steps).
It will also be noted that it is not excluded that multiple groups of independent 4 blocks on the same chip 2 or on different chips may each form a separate respective BSP group operating asynchronously with respect to one to the other, the BSP cycle of calculation, synchronization and exchange being imposed only within each given group, but each group doing so independently of the other groups. That is, a multi-block matrix 6 can include multiple internally synchronous groups each operating independently and asynchronously with respect to the others of these groups (described in more detail below). In some embodiments, there is a grouping
B17780 FR -408528FR hierarchical synchronization and exchange, as will be described in more detail below.
FIG. 9 illustrates the BSP principle as it is implemented among a group 41, 4ii, 4iii of some or all of the blocks of the matrix 6, in the case which requires: (a) synchronization with barrier between the calculation phase 52 and the exchange phase 50 (see above). Note that in this arrangement, some blocks 4 are allowed to start calculating 52 while some others are still exchanging.
According to the embodiments described here, this type of BSP can be facilitated by incorporating additional, special and dedicated functionalities in a machine code instruction for achieving barrier synchronization, that is to say the instruction SYNC.
In some embodiments, the SYNC function takes this functionality when it is qualified by an inter-block mode as an operand, for example the mode on the chip: SYNC chip.
This is illustrated diagrammatically in FIG. 10. In the case where each block 4 comprises a multi-thread processing unit 10, each calculation phase 52 of a block may in fact include tasks carried out by multiple working threads W on the same block 4 (and a calculation phase 52 given on a given block 4 can comprise one or more layers WL of working wires, which in the case of multiple layers can be separated by internal barrier synchronizations using the SYNC instruction with local mode on the pad as operand, as described above). Once the supervisor thread SV on a given block 4 has launched the last work thread in the current BSP super-step, the supervisor being on this block 4 then performs a
B17780 FR -408528FR SYNC instruction with the inter-block mode set as operand: SYNC chip. If the supervisor must launch (RUN) work threads in all the slots of his respective processing unit 10, the SYNC chip is executed as soon as the first slot which is no longer necessary to put other work threads in RUN in the current BSP super-step is returned to the supervisor. For example, this can happen after the first thread has made an EXIT in the last WL, or simply after the first working thread has made an EXIT if there is only one layer. Otherwise, if the slots are not all to be used for work threads executing in the current BSP super-step, the SYNC chip can be executed as soon as the last work thread to be put in RUN in the super step Current BSP has been launched. This can happen after all the work threads in the last layer have been put into RUN, or simply after all the work threads have been put in RUN if there is only one layer.
The execution unit (EXU) of the execution stage 18 is arranged so as, in response to the operation code of the SYNC instruction, when it is qualified by the operand on the chip ( inter-block), cause the supervisor wire in which the SYNC chip was executed to be paused until all the blocks 4 of the matrix 6 have completed the execution of the work wires. This can be used to implement a barrier to the next BSP super-step, i.e. after all the blocks 4 on chip 2 have passed the barrier, the program passing through the blocks as a whole can progress to the next exchange phase 50.
FIG. 11 is a diagram illustrating the logic triggered by a SYNC chip according to the embodiments described here.
B17780 FR -408528EN [0104] Once the supervisor has launched (RUN) all the execution threads that he must launch in the current calculation cycle 52, he executes a SYNC instruction with the inter-block operand, on the chip: SYNC chip. This causes the triggering of the following functionality in the dedicated synchronization logic 39 on block 4, and in a synchronization controller 36 implemented in the hardware interconnection 34. This functionality of both the synchronization logic 39 on the pad and synchronization controller 36 in interconnection 34 is implemented in a dedicated hardware circuit so that, once the SYNC chip is executed, the rest of the functionality proceeds without further instructions being executed for the make.
First, the synchronization logic on block 39 causes the issuance of instructions for the supervisor on block 4 in question to automatically pause (brings the extraction stage 14 and the scheduler 24 to suspend the issuance of instructions from the supervisor). Once all the pending work wires on the local block 4 have performed an EXIT, the synchronization logic 39 automatically sends a synchronization request sync_req to the synchronization controller 36 in the interconnection 34. The local block 4 then continues to wait with the issuance of supervisor instructions on pause. A similar process is also implemented on each of the other blocks 4 in the matrix 6 (each comprising its own instance of the synchronization logic 39). Thus, at a certain point, once all the final work threads in the current calculation phase 52 have made an EXIT on all the blocks 4 of the matrix 6, the synchronization controller 36 will have received a respective synchronization request ( sync_req) from all the blocks 4 of the matrix 6. It is only then,
B17780 FR -408528FR in response to reception of the sync_req from each block 4 of the matrix 6 on the same chip 2, that the synchronization controller 36 sends a sync_ack synchronization acknowledgment signal to the synchronization logic 39 on each of the blocks 4. Up to this point, each of the blocks 4 has had its transmission of supervisor instructions paused awaiting the synchronization acknowledgment signal (sync_ack). Upon reception of the sync_ack signal, the synchronization logic 39 located in block 4 automatically ends the pause in the transmission of supervisor instructions for the respective supervisor wire on this block 4. The supervisor is then free to proceed to an exchange of data with other blocks 4 via the interconnection 34 in a next exchange phase 50.
Preferably the sync_req and sync_ack signals are sent and received to and from the synchronization controller, respectively, via one or more dedicated synchronization wires connecting each block 4 to the synchronization controller 36 in the interconnection 34.
In addition, according to embodiments described here, additional functionality is included in the SYNC instruction, that is to say, that at least when it is executed in an inter-block mode (by example SYNC chip), the SYNC instruction also causes the local output states $ LC of each of the synchronized blocks 4 to be automatically aggregated into additional dedicated hardware 40 in the interconnection 34. In the embodiments shown, this logic takes the shape of an AND gate with multiple inputs (one input for each block 4 of the matrix 6), for example formed from a chain of AND doors with two inputs 40i, 40ii, ... as shown example in Figure 11. This
B17780 FR -408528FR inter-block aggregation logic 40 receives the value found in the local output status register (local consensus register) $ LC 38 from each block 4 of the matrix - in embodiments each being a single bit and aggregates these values into a single value, for example an AND of all locally aggregated output states. Thus the logic forms a globally aggregated output state on all the execution threads on all the blocks 4 of the matrix 6.
Each of the blocks 4 comprises a respective instance of a global consensus register ($ GC) 42 arranged to receive and store the global output state coming from the global aggregation logic 40 in the interconnection 34. In some embodiments this is another of the state registers found in the CXS context register bank of the supervisor. In response to the synchronization request (sync_req) received from all the blocks 4 of the matrix 6, the synchronization controller 36 causes the output of the aggregation logic 40 (for example the output of the AND) to be stored in the register of global consensus ($ GC) 42 on each block 4 (it will be noted that the switch represented in FIG. 11 is a schematic representation of the functionality and that in fact the update can be implemented by any appropriate digital logic). This register $ GC 42 is accessible by the supervisor wire SV on the respective block 4 once the transmission of supervisor instructions is resumed. In some embodiments, the global consensus register $ GC is implemented in the form of a command register in the command register bank so that the supervisor wire can obtain the value found in the global consensus register ($ GC) 42 using a GET instruction. It will be noted that the synchronization logic 36 waits for the sync_req to be received from all the blocks 4 before updating the value in one
B17780 FR -408528FR any of the global consensus registers ($ GC) 42, otherwise an incorrect value can be made available to a supervisor wire on a block which has not yet completed its part of calculation phase 52 and which is therefore still running.
The aggregate aggregate output state $ GC allows the program to determine an overall output of parts of the program executing on multiple different blocks 4 without having to individually examine the state of each individual work thread on each block individual. It can be used for any purpose desired by the programmer. For example, in the example shown in Figure 11 where the global aggregate is a Boolean AND, this means that any entry at 0 leads to an aggregate of 0, but if all the entries are at 1 then the aggregate is worth 1. C that is, if a 1 is used to represent a true exit or a success, it means that if any of the local exit states of one of the blocks 4 is false or in failure, then the global aggregate state will also be false or will represent a failed exit. For example, this could be used to determine if the portions of code running on all tiles have all met a predetermined condition. So the program can query a single register (in some embodiments a single bit) to ask is something went wrong Yes or no or have all the nodes of the graph reached an acceptable level of error Yes or no Rather than having to examine the individual states of the individual work threads on each individual pad (and again, in embodiments the supervisor is actually not able to query the state of the working wires except by output status registers 38, 42). In other words, each of the EXIT and
B17780 FR -408528FR
SYNC reduces multiple individual output states to a single combined state.
In an example of a use case, the supervisor located on one or more of the blocks can report to a host processor if the global aggregate has indicated a false or failed output. In another example, the program can make a connection decision based on the overall output state. For example, the program examines the aggregate aggregate output state $ GC and on the basis of this determines whether it should continue to loop or should branch elsewhere. If the global output state $ GC is always false or in failure, the program continues its iteration of the same first part of the program, but once the global output state $ GC is true or successful, the program branches to a second, different part of the program. The connection decision can be implemented individually in each supervisor wire, or by the fact that one of the supervisors takes the role of master and gives instructions to the other slave supervisors on the other blocks (the master role being configured by software ).
Note that the aggregation logic 40 shown in Figure 11 is only an example. In another equivalent example, the AND can be replaced by an OR, and the interpretation of 0 and 1 can be reversed (0 -> · true, 1 -> false). Equivalently if the AND gate is replaced by an OR gate but the interpretation of the output states is not reversed, and neither is the reset value, then the aggregate state in $ GC will save if any (instead of all) of the blocks is output with the locally aggregated state 1. In another example, the global output state $ GC can comprise two bits representing a trinary state: all the locally aggregated output states $ LC of pavers had state 1, all output states aggregated locally
B17780 FR -408528FR $ LC of the blocks had state 0, or the locally aggregated output states $ LC of the blocks were mixed. In another more complex example, the local exit states of blocks 4 and the globally aggregated exit state may each comprise two or more bits, which may be used, for example, to represent a degree of confidence in the results of the blocks 4. For example, the locally aggregated output state $ LC of each individual block can represent a statistical probabilistic measure of confidence in a result of the respective block 4, and the global aggregation logic 40 can be replaced by a more complex to perform a statistical aggregation of individual confidence levels by material.
As previously mentioned, in some embodiments multiple instances of chip 2 can be connected together to form an even larger array of blocks 4 spanning multiple chips 2. This is illustrated in Figure 12. Some or all of the chips 2 can be implemented on the same IC package or some or all of the chips 2 can be implemented on different IC packages. The chips 2 are connected to each other by an external interconnection 72 (via the external links 8 shown in FIG. 7). In addition to providing a conduit for exchanging data between blocks 4 located on different chips, the external exchange device 72 also provides hardware support for achieving barrier synchronization between blocks 4 located on different chips 2 and aggregate the local output states of the blocks 4 located on the different chips 2.
In certain embodiments, the SYNC instruction can take at least one other possible value from its mode operand to specify an external synchronization,
B17780 FR -408528FR ie inter-chip: SYNC zone n, where zone n represents an external synchronization zone. The external interconnection 72 includes hardware logic similar to that described in relation to FIG. 11, but on an external interpuce scale. When the SYNC instruction is executed with an external synchronization area of two or more chips specified in its operand, this causes the logic located in the external interconnection to operate in a similar manner to that described in relation to the internal interconnection 34, but in all of the blocks on the multiple 2 different chips in the specified synchronization area.
That is to say that, in response to an external SYNC, the transmission of instructions from the supervisor is paused until all the blocks 4 on all the chips 2 in the synchronization zone external have completed their calculation phase 52 and have submitted a synchronization request. In addition, the logic located in the external interconnection 72 aggregates the local output states of all these blocks 4, in the set of multiple chips 2 in the zone in question Once all the blocks 4 in the synchronization zone external have made the synchronization request, the external interconnection 72 sends a synchronization acknowledgment signal to blocks 4 and stores the aggregate aggregate output state at chip level in the global consensus registers ($ GC) 42 of all 4 blocks in question. In response to the synchronization acknowledgment, blocks 4 on all chips 2 in the area resume the transmission of instructions to the supervisor.
[0115]
In embodiments, the functionality of the interconnection 72 can be implemented in chips 2, that is to say that the logic can be distributed between the chips
B17780 FR -408528FR so that only wired connections between chips are required (Figures 11 and 12 are schematic).
All the blocks 4 located in the mentioned synchronization zone are programmed to indicate the same synchronization zone via the mode operand of their respective SYNC instructions. In embodiments, the synchronization logic in the device
of interconnection external 72 East arranged of so if that not' East not the case due a th error of programming or ij I a other error (like a fault of memory parity),
some or all of the blocks 4 will not receive an acknowledgment, and therefore the system will stop at the next external barrier, thus allowing an external management CPU (for example the host) to intervene for debugging or system recovery. In other embodiments an error is reported in the event that the synchronization areas do not match. However, preferably the compiler is arranged to ensure that the tiles in the same area all indicate the same correct synchronization area at the time concerned.
FIG. 13 illustrates an example of a BSP program flow involving both internal synchronization (on the chip) and external synchronization (inter-chip). As shown, it is preferable to keep the internal exchanges 50 (of data between blocks 4 on the same chip 2) separate from the external exchanges 50 '(of data between blocks 4 on different chips 2). One reason for this is that a global exchange between multiple chips, which is bounded by global synchronization, can be more costly in terms of latency and load balancing complexity than in the case of only synchronization and exchange. at the chip level. Another possible reason is that data exchange via the internal interconnection 34
B17780 FR -408528FR (on the chip) can be made deterministic over time, while in embodiments of the exchange of data via the external interconnection 72 can be non-deterministic over time. In such scenarios it may be useful to separate internal and external exchanges so that the synchronization and external exchange process does not contaminate internal synchronization and exchange.
Consequently, to obtain such a separation, in embodiments the program is arranged to carry out a sequence of synchronizations, exchange phases and calculation phases comprising in the following order: (i) a first calculation phase, then (ii) internal barrier synchronization 30, then (iii) internal exchange phase 50, then (iv) external barrier synchronization 80, then (v) external exchange phase 50 ' . See the chip of 211 in FIG. 13. The external barrier 80 is imposed after the internal exchange phase 50, so that the program does not carry out the external exchange 50 'until after the internal exchange 50. It will also be noted that as shown with regard to the chip of 21 in FIG. 12, optionally a calculation phase can be included between the internal exchange (iii) and the external barrier (iv). The global sequence is imposed by the program (for example by being generated as such by the compiler), and the internal synchronization and the exchange do not extend to blocks or other entities on another chip 2. The sequence (i) - (v) (with the optional calculation phase mentioned above between iii and iv) can be repeated in a series of global iterations. By iteration there can be multiple instances of internal calculation, synchronization and exchange (i) - (iii) before external synchronization and exchange.
B17780 FR -408528EN [0119] It will be noted that during an external exchange 50 the communications are not limited to being only external: certain blocks can carry out only internal exchanges, some can carry out only external exchanges, and some can carry out a mixture of of them. It will also be noted that, as shown in FIG. 13, it is generally possible to have a zero calculation phase 52 or a zero exchange phase 50 in any given BSP super-step.
In some embodiments, as also shown in FIG. 13, certain blocks 4 can perform local inputs / outputs during a calculation phase, for example they can exchange data with a host.
As illustrated in FIG. 14, in embodiments the mode of the SYNC instruction can be used to specify one of multiple different possible external synchronization zones, for example zone_l or zone_2. In embodiments, this corresponds to different hierarchical levels. That is to say that each upper hierarchical level 92 (for example zone 2) encompasses two or more zones 91A, 91B of at least one lower hierarchical level. In some embodiments, there are only two hierarchical levels, but higher numbers of nested levels are not excluded. If the operand of the SYNC instruction is set to the lower hierarchical level of the external synchronization zone (SYNC zone_l), then the synchronization and aggregation operations described above are carried out in relation to blocks 4 on chips 2 only in the same lower level external synchronization area as the block on which the SYNC was executed. If, on the contrary, the operand of the SYNC instruction is set to the higher hierarchical level of the zone of
B17780 FR -408528FR external synchronization (SYNC zone_2), then the synchronization and aggregation operations described above are carried out automatically in relation to all the blocks on all chips 2 in the same external synchronization zone of higher level as the block on which the SYNC was executed. In some embodiments the highest hierarchical level of synchronization area encompasses all of the bullets, i.e., it is used to achieve overall synchronization. When multiple lower level zones are used, a BSP can be imposed internally in group of blocks 4 on the chip (s) 2 in each zone, but each zone can operate asynchronously with respect to the others until that a global synchronization is carried out.
It will be noted that in other embodiments, the synchronization zones which can be specified by the mode of the SYNC instruction are not limited to being hierarchical in nature. In general, a SYNC instruction can be provided with modes corresponding to any sort of grouping. For example, modes can allow selection from only non-hierarchical groups, or a mixture of hierarchical groupings and one or more non-hierarchical groups (where at least one group is not entirely nested within another). This advantageously gives flexibility for the programmer or the compiler, with a minimum code density, to select between different arrangements of internally synchronous groups which are asynchronous with each other.
An example of a mechanism for implementing synchronization between the selected synchronization groups 91, 92 is illustrated in FIG. 16. As illustrated, the external synchronization logic 76 in the external interconnection 72 includes a block of
B17780 FR -408528FR respective synchronization 95 associated with each respective chip 2. Each synchronization block 95 comprises respective door logic and a respective synchronization aggregation device. The gate logic comprises hardware circuits which connect the chips 2 together in a daisy chain topology for the purpose of synchronization and aggregation of output states, and which propagate the synchronization and the output state information of the next way. The synchronization aggregation device comprises hardware circuits arranged to aggregate the synchronization requests (sync_req) and the output states as follows.
The respective synchronization block 95 associated with each chip 2 is connected to its respective chip 2, so that it can detect the synchronization request (Sync_req) sent by this chip 2 and the output state of this chip 2, and so that it can send back the synchronization acknowledgment (Sync_ack) and the overall output state to the respective chip 2. The respective synchronization block 95 associated with each chip 2 is also connected to the synchronization block 95 of at least one other of the chips 2 via an external synchronization interface comprising a bundle of four synchronization wires 96, of which details will be described more precisely in a moment. This can be part of one of the chip-to-chip links 8. In the case of a link between chips 2 on different cards, the interface 8 can for example comprise a PCI interface and the four synchronization wires 96 can be implemented by reusing four wires from the PCI interface. Some of the synchronization blocks 95 of the chips are connected to those of two adjacent chips 2, each connection being made via a respective instance of the four wires of
B17780 FR -408528FR synchronization 96. In this way, the chips 2 can be connected in one or more strings via their synchronization blocks 95. This allows synchronization requests, synchronization acknowledgments, aggregates output state currents, and global output states, are propagated up and down the chain.
In operation, for each synchronization group 91, 92, the synchronization block 95 associated with one of the chips 2 in this group is put as master for the purpose of synchronization and aggregation of output states, the rest of the group being slaves for that. Each of the slave synchronization blocks 95 is configured with the direction (for example left or right) in which it needs to propagate synchronization requests, synchronization acknowledgments and output states for each synchronization group 91, 92 (i.e. direction to the master). In certain embodiments these settings are configurable by software, for example in an initial configuration phase after which the configuration remains in place after the operation of the system. For example this can be configured by the host processor. As a variant, it is not excluded that the configuration can be wired. In any case, the different synchronization groups 91, 92 can have different masters and in general it is possible that a given chip 2 (or rather its synchronization block 95) is the master of a group and not of a another group of which she is a member, or be the master of multiple groups.
For example, let us consider the example of a scenario in FIG. 16. Let us say by way of example that the synchronization block 95 of the 2IV chip is put as master of a given synchronization group 91A.
B17780 FR -408528FR chips 2, connected via their synchronization blocks 95 and working wires of the current calculation phase on the first chip executed an EXIT instruction, and the supervisors on a SYNC instruction specifying the synchronization group 91A, then the first chip signals that it is ready for synchronization at its respective associated synchronization block 95. The chip also provides its respective sound synchronization block).
In response, the synchronization block 95 of the first chip propagates a synchronization request (Sync_req) to the block of the next 2II chip in the chain.
It also propagates the output state of the first chip to the following synchronization block 95.
The chip synchronization block 211 waits for the supervisors of its own blocks 4 to all have executed a SYNC instruction specifying the synchronization group 91A, bringing the second chip
211 to indicate that it is ready for synchronization. That chip synchronization 95 propagates a synchronization request to the synchronization block 95 of the chip in the chain, and also propagates a current aggregate of the output state of the first chip 21 had become ready for synchronization before the first one
21, then the synchronization block 95 of the second chip 211 would have waited for the first chip 21 to signal a synchronization request before propagating the request for
B17780 FR -408528FR synchronization to the synchronization block 95 of the third chip 2III. The synchronization block 95 of the third chip 2III behaves in a similar manner, this time aggregating the current aggregate output state from the second chip 211 to obtain the next current aggregate to go forward, etc. This continues towards the master synchronization block, that of the 2IV chip in this example.
The synchronization block 95 of the master then determines a global aggregate of all the output states on the basis of the current aggregate of output from its own 2IV chip.
It propagates this global aggregate backwards along chip 2, accompanied by
The synchronization acknowledgment (Sync_ack).
[0128]
If the master is halfway in a chain, unlike being at a given end as in the example above, then the synchronization and state information of opposite directions of each propagates in masters, two sides towards the master. In this case, the master only issues the synchronization acknowledgment and the overall output status once the synchronization request from both sides has been received. For example, consider the case where chip 2III is master of group 92. In addition, in embodiments the synchronization block 95 of some of the chips 2 could be connected to that of three or more other chips 2, thus creating multiple branches of chains in the direction of the master. Each chain then behaves as described above, and the master only issues the synchronization acknowledgment and the overall exit status once the synchronization request has been received from all the chains. And / or, one or more of the 2 chips could be connected to an external resource like the
B17780 FR -408528FR host processor, network card, storage device, or
FPGA.
embodiments the synchronization signaling is implemented as follows. The bundle of four synchronization wires 96 between each pair of chips comprises two pairs of wires, a first pair
0 and a second pair 96 1. Each pair includes an instance of a synchronization request thread and an instance of a synchronization acknowledgment thread. To signal synchronization 95 of the transmitting chip uses the synchronization request wire of the first pair of wires 96_0 when signaling the (sync_req), or to signal a current aggregate of value 1 the synchronization block 95 uses the synchronization wire of the second pair of wires signaling the synchronization request.
request
1 during
To signal synchronization 95 of the sending chip 2 uses the synchronization acknowledgment wire of the first synchronization pair (sync_ack), or to signal a global aggregate of value 1 the synchronization block 95 uses the synchronization request wire the second pair of wires 96 1 during synchronization.
[0130]
It will be noted that what has just been mentioned is only the mechanism intended for the propagation of synchronization and of the output state information.
The actual data (content) is transmitted by another channel, for example as described below with reference to FIG. 16. In addition, it will be noted that this
B17780 FR -408528FR is only an example of implementation, and those skilled in the art will be able to construct other circuits to implement the described synchronization and the aggregation functionality once given the specification of this functionality described here. For example, the synchronization logic (95 in Figure 18) could instead use packets transported on interconnection 34, 72 as an alternative to dedicated wiring. For example, sync_req and / or sync_ack could each be transmitted as one or more packets.
The functionality of the SYNC instruction in the various possible modes is summarized as follows.
SYNC tile (performs a synchronization with a local barrier on a block) • the supervisor's execution mode passes from execution to waiting for the working wires to come out • suspending the issuance of instructions for the supervisor wire until all work threads are inactive • when all work threads are inactive, the aggregate work output status is made available through the local consensus register ($ LC) 38.
SYNC chip (performs an internal barrier synchronization on a chip) • the supervisor's execution mode passes from execution to waiting for the working wires to come out • suspending the issuance of instructions for the supervisor wire until all the working threads are inactive • when all the working threads are inactive:
- the output state of the aggregated local work thread is made available through the local consensus register ($ LC) 38
- participation in internal synchronization is reported
B17780 FR -408528FR to exchange fabric 34
- the supervisor remains inactive until block 4 receives an acknowledgment of internal synchronization from the exchange fabric 34
- the exit status at system level is updated in the global consensus register ($ GC) 42.
SYNC zone_n (performs synchronization with an external barrier in the whole of zone n) • the supervisor's mode of execution passes from execution to waiting for the working wires to come out • suspending the issuance of instructions for the supervisor thread until all the working threads are inactive • when all the working threads are inactive:
- the aggregated local work thread output status is made available through the local consensus register ($ LC) 38
a participation in the external synchronization is signaled to the external system, for example the synchronization logic in the above-mentioned external interconnection 72
- the supervisor remains suspended until block 4 receives an external synchronization acknowledgment from the external system 72
- the system-level exit status is updated in the global consensus register ($ GC) 42.
FIG. 15 illustrates an example of application of the processor architecture described here, namely an artificial intelligence application.
As is well known to those skilled in the art in the technique of artificial intelligence, artificial intelligence begins with a learning step in which the artificial intelligence algorithm learns a knowledge model. . The model includes a graph of nodes
B17780 FR -408528FR (that is, vertices) interconnected 102 and edges (that is, links) 104. Each node 102 in the graph has one or more input stops and one or more exit stops. Some of the input edges of some of the nodes 102 are the output edges of some of the other nodes, thereby connecting the nodes to each other to form the graph. In addition, one or more of the input edges of one or more of the nodes 102 form the inputs of the graph as a whole, and one or more of the output edges of one or more of the nodes 102 form the outputs of the graph in his outfit. Sometimes a given node can even have it all: inputs from the graph, outputs from the graph and connections to other nodes. Each stop 104 communicates a value or more often a tensor (n-dimensional matrix), this forming the inputs and outputs supplied to the nodes and obtained from the nodes 102 on their input and output edges respectively.
Each node 102 represents a function of its one or more inputs received on its input edge (s), the result of this function being the output (s) provided on the output edge (s). Each function is parameterized by one or more respective parameters (sometimes called weights or weights, although they do not necessarily have to be multiplier weights). In general, the functions represented by the different nodes 102 can take different forms of function and / or can be parameterized by different parameters.
In addition, each of said one or more parameters of each function of a node is characterized by a respective error value. In addition, a respective condition can be associated with the error or errors in the parameter or parameters of each node 102. For a node 102 representing a function parameterized by a single parameter, the condition
B17780 FR -408528FR can be a simple threshold, that is to say that the condition is satisfied if the error is within the limits of the specified threshold but is not satisfied if the error is beyond the threshold.
For a node 102 configured with more than one respective parameter, the condition for this node 102 to have reached an acceptable level of error can be more complex. For example, the condition can be satisfied only if each of the parameters of this node 102 remains below the respective threshold. In another example, a combined metric can be defined as combining the errors in the different parameters for the same node 102, and the condition can be satisfied if the value of the combined metric remains below a specified threshold, but otherwise the condition n 'is not satisfied if the value of the combined metric is above the threshold (or vice versa depending on the definition of the metric). Whatever the condition, this gives a measure of whether the error in the node parameter (s) remains below a certain level or degree of acceptability. In general any suitable metric can be used. The condition or the metric can be the same for all the nodes, or can be different for certain respective different nodes.
In the learning step, the algorithm receives experimental data, that is to say multiple data points representing different possible combinations of entries in the graph. As experimental data is received, the algorithm gradually adjusts the parameters of the various nodes 102 of the graph on the basis of the experimental data so as to try to minimize errors in the parameters. The goal is to find parameter values such that the output of the graph is as close as possible to a desired output for a given input. When the graph as a whole
B17780 FR -408528FR tends towards such a state, we say that the graph converges. After an appropriate degree of convergence the graph can be used to make predictions or inferences, that is, to predict an exit for a certain given entry or to infer a cause for a certain given exit.
The learning stage can take a number of different possible forms. For example, in a supervised approach, the experimental input data takes the form of training data, that is to say inputs which correspond to known outputs. With each data point, the algorithm can adjust the parameters so that the output more closely matches the known output for the given input. In the next prediction stage, the graph can then be used to map an input query to an approximate predicted output (or vice versa if an inference is made). Other approaches are also possible. For example, in an unsupervised approach, there is no concept of reference result per input data, and instead we let the artificial intelligence algorithm identify its own structure in the output data. Or, in a reinforcement approach, the algorithm tries at least one possible output for each data point in the experimental input data, and it is told whether its output is positive or negative (and potentially the degree to which it is positive or negative), e.g. won or lost, or reward or punishment, or the like. On many tests, the algorithm can gradually adjust the parameters of the graph to be able to predict inputs which will lead to a positive output. The various approaches and algorithms for learning a graph are known to those skilled in the art in the field of artificial intelligence.
According to an example of application of the techniques described here, each work thread is programmed to perform the calculations associated with a respective individual node among the nodes 102 in an artificial intelligence graph. In this case, at least some of the edges 104 between the nodes 102 correspond to the exchanges of data between wires, and some may involve exchanges between blocks. In addition, the individual output states of the working threads are used by the programmer to represent whether the respective node 102 has satisfied or not its respective condition for the convergence of the parameter or parameters of this node, that is to say if the error in the parameter (s) remains in the acceptable level or region in the error space. For example, there is an example of using the embodiments in which each of the individual output states is an individual bit and the aggregate output state is an AND of the individual output states (or equivalently an OR if 0 is taken as positive); or wherein the aggregate output state is a trinary value representing whether the individual output states were all true, all false or mixed. Thus, by examining a single register value in the output state register 38, the program can determine whether the whole graph, or at least one sub-region of the graph, has converged to an acceptable degree.
In another variant of this, it is possible to use embodiments in which the aggregation takes the form of a statistical aggregation of individual confidence values. In this case, each individual output state represents a confidence (for example a percentage) that the parameters of the node represented by the respective thread have reached an acceptable degree of error. The aggregated exit status can then be used to determine a degree of confidence
B17780 FR -408528EN global indicating whether the graph, or a sub-region of the graph, has converged to an acceptable degree.
In the case of a multi-block arrangement 6, each block executes a subgraph of the graph. Each subgraph includes a supervisor routine comprising one or more supervisor threads, and a set of work threads in which some or all of the work threads can take the form of codelets.
In such applications, or indeed in any graph-based application where each working thread is used to represent a respective node in a graph, the codelet that each working thread includes can be defined as an acting software procedure on the persistent state and the inputs and / or outputs of a vertex, in which the codelet:
• is launched on a work thread register context, to execute in a barrel slot, by the supervisor thread executing a run instruction;
• runs until completion without communication with other codelets or the supervisor (except for the return to the supervisor when the codelet leaves);
• has access to the persistent state of a vertex via a memory pointer provided by the run instruction, and to a non-persistent work area in memory which is private for this barrel slot; and • executes an EXIT as its last instruction, after which the barrel slot it used is returned to the supervisor, and the exit state specified by the exit instruction is aggregated with the local exit state of the keypad which is visible to the supervisor.
Updating a graph (or a sub-graph) means updating once each constituent vertex, in
B17780 FR -408528FR any order consistent with the causality defined by the edges. Updating a vertex means executing a codelet on the state of the vertex. A codelet is an update procedure for vertices - a codelet is usually associated with many vertices. The supervisor executes a RUN per vertex instruction, each of these instructions specifying a vertex state address and a codelet address.
Note that the above embodiments have been described only by way of example.
For example, the applicability of the mechanism for aggregating exit states is not limited to the architecture described above in which a separate context is provided for the supervisor wire, or in which the Supervisor thread runs in a time slot and then abandons its time slot to a working thread. In another arrangement, for example, the supervisor can run in his own dedicated window.
Furthermore, the terms supervisor and work thread do not imply specific responsibilities unless this is explicitly mentioned, and in particular are not necessarily limited in themselves to the diagram described above in which a supervisor thread abandons its time slot to a working thread and so on. In general, a work thread can designate any thread to which a calculation task is allocated. The supervisor can represent any kind of supervision or coordination thread responsible for actions such as: assigning working wires to barrel slots, and / or performing barrier synchronizations between multiple wires, and / or performing any operation of flow control (like a connection) depending on the output of more than a single wire.
When reference is made to a sequence of interleaved time slots, or the like, this does not necessarily imply that the sequence mentioned consists of all the possible or available slots. For example, the sequence in question could consist of all possible slots or only those which are currently active. It is not necessarily excluded that there may be other potential slots which are not currently included in the planned sequence.
The term block as it is used here is not necessarily limited to a particular topography or the like, and in general can denote any modular unit of processing resources comprising a processing unit 10 and a corresponding memory there, in a matrix of similar modules, at least some of which are typically on the same chip (that is to say the same elementary chip).
In addition, the scope of the present description is not limited to a deterministic internal interconnection over time or a non-deterministic external interconnection over time. The synchronization and aggregation mechanisms described here can also be used in a completely deterministic time arrangement, or a completely non-deterministic time arrangement.
In addition, when reference is made to achieving synchronization or aggregation in a group of tiles, or between a plurality of tiles or the like, this need not necessarily designate all the tiles on the chip or all the blocks in the system unless explicitly stated. For example, the SYNC and EXIT instructions could be arranged to perform synchronization and aggregation only in relation to a certain subset of blocks 4 on a given chip and / or only a subset of
B17780 FR -408528FR chips 2 in a given system; while certain other blocks 4 on a given chip, and / or certain other chips in a given system, may not be involved in a given BSP group, and could even be used for a completely separate set of tasks not related to computation which is made by the group at hand.
Also, while certain SYNC instruction modes have been described here, the scope of the present description more generally is not limited to such modes. For example, the list of modes given previously is not necessarily exhaustive. Or in other embodiments, the SYNC instruction may have fewer modes, for example the SYNC does not need to support different hierarchical levels of external synchronization, or does not need to distinguish between synchronizations on the chip and between chips (that is to say in an inter-block mode, always acts in relation to all the blocks whether on the chip or off-chip). In still other alternative embodiments, the SYNC instruction does not need to take a mode as an operand at all. For example, in some embodiments separate versions of the SYNC instruction (different operation codes) may be provided for different levels of synchronization and aggregation of output states (such as different SYNC instructions for synchronization on the blocks and inter-block, synchronization on the chips). Or in other embodiments, a dedicated SYNC instruction can be provided only for inter-block synchronizations (leaving synchronization at the block level between the wires, if necessary, to be carried out by general-purpose software).
In addition, the synchronization zones are not limited to being hierarchical (that is to say nested within one another), and in other embodiments,
B17780 EN -408528EN the selectable synchronization zones can be made up of or include one or more non-hierarchical groups (all the blocks of this group not nested in a single other selectable group).
In addition, the synchronization schemes described here do not exclude the implication, in embodiments, of external resources other than multi-block processors, for example a CPU processor like the host processor, or even a or several components which are not processors such as one or more network cards, storage devices and / or FPGAs. For example, certain blocks may choose to engage in data transfers with an external system, these transfers forming the computational load of this block. In this case, transfers should be completed before the next barrier. In some cases, the exit status of the block may depend on the result of the communication with the external resource, and this resource may indirectly influence the exit state. Alternatively or in addition, resources other than multi-block processors, for example the host or one or more FPGAs, could be incorporated into the synchronization network itself. That is, a synchronization signal such as a Sync_req is required from this or these additional resources so that the synchronization barrier is satisfied and the blocks proceed to the next exchange phase. Furthermore, in embodiments the aggregate global output state may include in the aggregation an output state of the external resource, for example from an FPGA.
Other applications and variants of the techniques described may appear to those skilled in the art with the description given here. The field of the present description is not limited by the embodiments described but only by the appended claims.

权利要求:
Claims (19)
[1" id="c-fr-0001]
1. Processor including:
an execution pipeline comprising an execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in a processor instruction set, each instruction type in the instruction set instructions being defined by a corresponding operation code and by zero or more fields of operands to take zero or more operands;
multiple sets of context registers;
a scheduler arranged to control the execution pipeline to provide a repetitive sequence of time interleaved time slots, thereby enabling at least one respective work thread to be allocated for execution in each respective slot of some or all of the slots time, wherein a program state of the respective job stream running in each time slot is maintained in a respective one of the sets of context registers; and an output state register arranged to store an aggregated output state of the working wires;
wherein the instruction set includes an output instruction to be included in each of the working threads, the output instruction taking at least one individual output state of the respective thread as the operand; and in which the execution unit comprises dedicated hardware logic arranged so as, in response to the operation code of the output instruction, to terminate the execution of the respective work thread in its respective time slot, and also to bring the output state
B17780 FR -408528FR individual specified in the operand to contribute to the output state aggregated in the output state register.
[2" id="c-fr-0002]
2. Processor according to claim 1, in which the output instruction comprises a single operand field taking a single operand in the form of the individual output state.
[3" id="c-fr-0003]
The processor of claim 1 or 2, wherein each of the individual output states and the aggregate output states is only one bit.
[4" id="c-fr-0004]
4. Processor according to claim 3, in which the aggregation consists of a Boolean AND of the individual output states, or of a Boolean OR of the individual output states.
[5" id="c-fr-0005]
5. The processor of claim 1 or 2, wherein the aggregated output state comprises at least two bits representing a trinary value, indicating whether the individual binary output states are all 1, all 0 or mixed.
[6" id="c-fr-0006]
6. Processor according to any one of the preceding claims, further arranged, in one or more of the time slots, during at least certain repetitions of said sequence of time slots, to execute a supervisor wire which allocates the work wires to the time slots. 'respective execution.
[7" id="c-fr-0007]
7. The processor of claim 6, wherein the multiple sets of context registers include multiple sets of working context registers, each set of working context registers being arranged to maintain the program state of the respective work thread. running in the respective time slot when running the respective work thread,
B17780 FR -408528FR and a set of additional supervisor context registers comprising additional registers arranged to store a program state of the supervisor wire.
[8" id="c-fr-0008]
8. The processor of claim 6 or 7, wherein the supervisor thread begins to run in each of the plurality of time slots, then abandons some or all of the time slots to the respective work threads; and wherein the output instruction causes the supervisor thread to resume execution in the respective time slot of the work thread that executed the output instruction.
[9" id="c-fr-0009]
9. The processor as claimed in claim 8, in which the instruction set further comprises an abandonment instruction and the execution stage is arranged to carry out the abandonment of the respective execution slot in response to the operation code of l 'abandonment instruction executed by the supervisor wire in the respective time slot which is abandoned.
[10" id="c-fr-0010]
10. Processor according to claim 1, in which the processor comprises a group of blocks each of which comprises an instance of the execution stage, of multiple contexts, of the scheduler and of the output status register. ; and wherein the processor further includes an interconnection for communicating between the tiles.
[11" id="c-fr-0011]
11. The processor as claimed in claim 10, in which the interconnection comprises dedicated hardware logic arranged to automatically aggregate the aggregate output states coming from the group of tiles into a global aggregate, and to make the global aggregate available to at least l 'one of the wires on each of the pavers.
B17780 FR -408528FR
[12" id="c-fr-0012]
12. Processor according to claim 11 and any one of claims 6 to 9, wherein said at least one wire comprises the supervisor wire.
[13" id="c-fr-0013]
13. Processor according to claim 11 or 12, in which each of the blocks further comprises a global aggregate register arranged to be readable by said at least one wire on this block; and wherein the logic in the interconnection is arranged to automatically make the global aggregate available for said at least one wire on each block by automatically storing the global aggregate in the global aggregate register on each block.
[14" id="c-fr-0014]
14. Processor according to any one of the claims
10 to 13, in which the interconnection comprises a synchronization controller operable to apply a massive synchronous parallel exchange process to the communications between blocks, where it follows that when each of the blocks is programmed to carry out a phase of inter exchange -paved and a calculation phase on the pavers then either a) the exchange phase is retained until all the working threads on all the pavers in the group have completed the calculation phase, or b) the calculation is retained until all the blocks in the group have completed the exchange phase.
[15" id="c-fr-0015]
15. The processor as claimed in claim 14, in which: the instruction set further comprises a barrier synchronization instruction to be included in one of the wires in each of the blocks following (a) the calculation phase or ( b) the exchange phase, respectively;
on each block, the execution stage is arranged so that, when the barrier synchronization instruction is executed, send a synchronization request to the
B17780 FR -408528FR synchronization controller in the interconnection; and the synchronization controller is arranged to return a synchronization acknowledgment signal to each of the blocks in response to receipt of an instance of the synchronization request signal from all of the blocks, the synchronization acknowledgment signal synchronization reception releasing the next (a) exchange phase or (b) calculation phase, accordingly.
[16" id="c-fr-0016]
16. Processor according to claim 14 or 15, and any one of claims 6 to 9, in which the exchange phase is arranged to be carried out by the supervisor wire.
[17" id="c-fr-0017]
17. Processor according to any one of the preceding claims, programmed to carry out an artificial intelligence algorithm in which each node of a graph comprises one or more respective input edges and one or more respective output edges, the edges d input of at least some of the nodes being the output edges of at least some others of the nodes, each node comprising a respective function connecting its output edges to its input edges, each respective function being parameterized by one or more respective parameters, and each of the respective parameters having an associated error, so that the graph converges to a solution when the errors in some or all of the parameters are reduced;
wherein each of the working threads models a respective one of the nodes of the graph, and each of the individual output states is used to indicate whether the errors in said one or more parameters of the respective node have satisfied a predetermined condition.
B17780 FR -408528FR
[18" id="c-fr-0018]
18. Method for actuating a processor comprising an execution pipeline and multiple sets of context registers, the execution pipeline comprising an execution unit for executing machine code instructions, each being an instance of a predefined set of instruction types in a processor instruction set, each instruction type in the instruction set being defined by a corresponding operation code and by zero or more operand fields to take zero or more operands ; in which the method comprises:
scheduling the execution pipeline to obtain a repetitive sequence of time interleaved time slots, thereby allowing at least one respective work thread to be allocated for execution in each respective time slot of some or all of the time slots, in which a program state of the respective work thread running in each slot is maintained in a respective one of the sets of context registers; and at least temporarily maintaining an aggregated output state of the working wires in an output state register of the processor;
wherein the instruction set includes an output instruction which is included in each of the work threads, the output instruction taking at least one individual output state from the respective work thread as operand; and the method comprises, in response to the operation code of the output instruction in execution, the triggering of dedicated hardware logic of the processor to end the execution of the respective work thread in its respective time slot, and to bring the individual exit state specified in the operand to contribute to the aggregate exit state in the exit state register.
B17780 FR -408528FR
[19" id="c-fr-0019]
19. A computer program product incorporated on a storage readable by a computer and comprising code arranged to execute on the processor of any one of claims 1 to 17, the code comprising the work threads comprising the output instruction in each working thread.

类似技术:

公开号 | 公开日 | 专利标题

FR3072800A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESSING ARRANGEMENT

FR3072799A1|2019-04-26|COMBINING STATES OF MULTIPLE EXECUTIVE WIRES IN A MULTIPLE WIRE PROCESSOR

FR3072798A1|2019-04-26|ORDERING OF TASKS IN A MULTI-CORRECTION PROCESSOR

FR3072797A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVING AND MULTI-CHIP TREATMENT ARRANGEMENT

FR3072801A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESS MATRIX

KR102190879B1|2020-12-14|Synchronization amongst processor tiles

JP6997285B2|2022-01-17|Multipurpose parallel processing architecture

US10338931B2|2019-07-02|Approximate synchronization for parallel deep learning

KR102183118B1|2020-11-25|Synchronization in a multi-tile processing arrangement

EP1949234A1|2008-07-30|Method and system for conducting intensive multitask and multiflow calculation in real-time

FR3090924A1|2020-06-26|EXCHANGE OF DATA IN A COMPUTER

FR3091389A1|2020-07-03|REGISTER BENCHES IN A MULTIPLE PERFORMANCE WIRE PROCESSOR

US20220066943A1|2022-03-03|Banked memory architecture for multiple parallel datapath channels in an accelerator

同族专利:

公开号 | 公开日

US11113060B2|2021-09-07|

KR102253628B1|2021-05-18|

DE102018126003A1|2019-04-25|

GB201717300D0|2017-12-06|

KR20190044569A|2019-04-30|

GB2569098A|2019-06-12|

JP6698784B2|2020-05-27|

TW201923562A|2019-06-16|

US20190121638A1|2019-04-25|

GB2569098B|2020-01-08|

CN110121702A|2019-08-13|

WO2019076713A1|2019-04-25|

JP2019079527A|2019-05-23|

CA3021407A1|2019-04-20|

CA3021407C|2020-12-29|

US10452396B2|2019-10-22|

TWI703496B|2020-09-01|

US20190354370A1|2019-11-21|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

GB2296534B|1993-07-20|1996-12-04|Graco Inc|A two-stage air valve actuator for a double-diaphragm pump|

JP3169779B2|1994-12-19|2001-05-28|日本電気株式会社|Multi-thread processor|

US20060136919A1|2004-12-17|2006-06-22|Sun Microsystems, Inc.|System and method for controlling thread suspension in a multithreaded processor|

US7810083B2|2004-12-30|2010-10-05|Intel Corporation|Mechanism to emulate user-level multithreading on an OS-sequestered sequencer|

KR100728899B1|2005-10-27|2007-06-15|한국과학기술원|High Performance Embedded Processor with Multiple Register Sets and Hardware Context Manager|

US8194690B1|2006-05-24|2012-06-05|Tilera Corporation|Packet processing in a parallel processing environment|

US8868888B2|2007-09-06|2014-10-21|Qualcomm Incorporated|System and method of executing instructions in a multi-stage data processing pipeline|

US20090171379A1|2007-12-27|2009-07-02|Ethicon Endo-Surgery, Inc.|Fluid logic for regulating restriction devices|

DE102008020782B4|2008-04-25|2010-08-12|Fujitsu Siemens Computers Gmbh|Method for processing programs in parallel and processor suitable for carrying out the method|

US8433884B2|2008-06-19|2013-04-30|Panasonic Corporation|Multiprocessor|

US8423750B2|2010-05-12|2013-04-16|International Business Machines Corporation|Hardware assist thread for increasing code parallelism|

US8713290B2|2010-09-20|2014-04-29|International Business Machines Corporation|Scaleable status tracking of multiple assist hardware threads|

US8607247B2|2011-11-03|2013-12-10|Advanced Micro Devices, Inc.|Method and system for workitem synchronization|

JP5981352B2|2013-01-04|2016-08-31|日本電信電話株式会社|Manufacturing method of semiconductor device|

US9575802B2|2014-10-28|2017-02-21|International Business Machines Corporation|Controlling execution of threads in a multi-threaded processor|

JP6007430B2|2015-05-20|2016-10-12|大澤　昇平|Machine learning model design support device, machine learning model design support method, program for machine learning model design support device|

US9645937B2|2015-08-28|2017-05-09|International Business Machines Corporation|Expedited servicing of store operations in a data processing system|

US10832120B2|2015-12-11|2020-11-10|Baidu Usa Llc|Systems and methods for a multi-core optimized recurrent neural network|

GB2569273B|2017-10-20|2020-01-01|Graphcore Ltd|Synchronization in a multi-tile processing arrangement|

GB2569274B|2017-10-20|2020-07-15|Graphcore Ltd|Synchronization amongst processor tiles|

GB2569269B|2017-10-20|2020-07-15|Graphcore Ltd|Synchronization in a multi-tile processing arrangement|

GB2569098B|2017-10-20|2020-01-08|Graphcore Ltd|Combining states of multiple threads in a multi-threaded processor|

GB2569271B|2017-10-20|2020-05-13|Graphcore Ltd|Synchronization with a host processor|GB2569098B|2017-10-20|2020-01-08|Graphcore Ltd|Combining states of multiple threads in a multi-threaded processor|

US20190319933A1|2018-04-12|2019-10-17|Alibaba Group Holding Limited|Cooperative tls acceleration|

US11062680B2|2018-12-20|2021-07-13|Advanced Micro Devices, Inc.|Raster order view|

GB2580165B|2018-12-21|2021-02-24|Graphcore Ltd|Data exchange in a computer with predetermined delay|

法律状态:
2019-10-15| PLFP| Fee payment|Year of fee payment: 2 |

2020-10-29| PLFP| Fee payment|Year of fee payment: 3 |

2021-10-01| PLSC| Publication of the preliminary search report|Effective date: 20211001 |

2021-10-27| PLFP| Fee payment|Year of fee payment: 4 |

优先权:

申请号 | 申请日 | 专利标题

GB1717300.6A|GB2569098B|2017-10-20|2017-10-20|Combining states of multiple threads in a multi-threaded processor|

GB17173006|2017-10-20|

[返回顶部]